This script reprocesses 10X single-cell RNA-seq data from Cheung et al. 2018 (GSE120410) using Seurat v3. Standard filtering and normalization of data is performed (following Seurat PBMC analysis workflow). Cell-type markers are identified and clusters are manually annotated using pituitary cell-type markers curated from the original publication (Cheung et al. 2018) as well as rat anterior pituitary single-cell study (Fletcher et al. 2019), and mouse posterior pituitary single-cell study (Chen et al. 2019). Using the reprocessed scRNA-seq data, enrichment for genes in co-expression modules within different single-cell clusters is assessed. Enrichment is calculated based on expression of a given gene within a cell cluster compared to all other cell clusters using a one-tailed Kolmogorov-Smirnov (KS) test. A gene is considered enriched if FDR < 0.05 and FC > 0 in a cell cluster compared to other cell clusters.
To run: Set working directory “To Source File Location”.
#### Libraries ####
library(dplyr)
library(Seurat)
library(knitr)
library(biomaRt)
library(igraph)
library(ggraph)
library(intergraph)
# library(rowr)
library(scales)
library(EDASeq)
library(pheatmap)
library(LaCroixColoR)
library(grid)
library(plyr)
library(ggpubr)
library(reshape2)
library(ggplot2)
library(colorspace)
library(ggrepel)
library(openxlsx)
source("cheung_sc_reprocess_2021-03-22_functions_rmd_ver.R")
#### Load in scRNA-seq data ####
pit_data <- Read10X(data.dir = "input_files/GSE120410_Cheung_2018/")
pit_seurat <- CreateSeuratObject(counts = pit_data,
project = "cheung_pit",
min.cells = 3,
min.features = 200)
pit_seurat
## An object of class Seurat
## 18441 features across 13620 samples within 1 assay
## Active assay: RNA (18441 features, 0 variable features)
In each cell, number of counts, number of RNA features (unique genes), and mitochondrial content is assessed. Based on the scatter plots, cells are filtered for: nFeature_RNA > 200 & nFeature_RNA < 7000 & percent.mt < 20. This results in 13565/13620 cells retained.
#### Cell filtering ####
# Visualize QC metrics as a violin plot and use to filter cells by unique feature counts
pit_seurat[["percent.mt"]] <- PercentageFeatureSet(pit_seurat, pattern = "^mt-")
print(VlnPlot(pit_seurat, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3))
Visualization of QC metrics pre-filtering
# FeatureScatter is typically used to visualize feature-feature relationships, but can be used
# for anything calculated by the object, i.e. columns in object metadata, PC scores etc.
plot1 <- FeatureScatter(pit_seurat, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pit_seurat, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
print(plot1+plot2)
Visualization of QC metrics pre-filtering
# Apply filter to data based on QC metrics
pit_seurat <- subset(pit_seurat, subset = nFeature_RNA > 200 & nFeature_RNA < 7000 & percent.mt < 20)
pit_seurat # After subsetting with filter
## An object of class Seurat
## 18441 features across 13565 samples within 1 assay
## Active assay: RNA (18441 features, 0 variable features)
print(VlnPlot(pit_seurat, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3))
Visualization of QC metrics after filtering
plot1 <- FeatureScatter(pit_seurat, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pit_seurat, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
print(plot1+plot2)
Visualization of QC metrics after filtering
Apply global-scaling normalization of the feature expression measurements for each cell by the total expression. Counts are then mulitplied by a scaling factor (10,000) and then log-transformed.
#### Data global normalization ####
pit_seurat <- NormalizeData(pit_seurat, normalization.method = "LogNormalize", scale.factor = 10000)
Genes (referred to as features) exhibiting high cell-to-cell variation are identified. These cells are likely highly expressed in some cells but lowly expressed in others. Focusing on these genes downstream helps highlight important biological signal in the dataset. Selected features will be used in downstream analyses like PCA. Here, we used 3000 variable features based on above scatter plots.
#### Feature (gene) selection ####
pit_seurat <- FindVariableFeatures(pit_seurat, selection.method = "vst", nfeatures = 3000) # based on violin plot
# Identify the 20 most highly variable genes
top20 <- head(VariableFeatures(pit_seurat), 20)
top20
## [1] "Pomc" "Apoe" "Hbb-bs" "Hba-a1" "Hba-a2" "Cd74" "Lhb" "Dcn"
## [9] "Cga" "H2-Ab1" "Hbb-bt" "Mgp" "H2-Aa" "Mia" "H2-Eb1" "C1qa"
## [17] "Igfbp3" "C1qb" "Tshb" "Fshb"
# plot variable features with and without labels
plot1 <- VariableFeaturePlot(pit_seurat)
plot2 <- LabelPoints(plot = plot1, points = top20, repel = TRUE)
print(plot1 + plot2)
Linear transformation ‘scaling’ prior to dimensional reduction like PCA shifts expression of each gene so mean expression across cells is 0. Scales gene expression for each gene so that variance across cells is 1 (so that highly expressed genes do not dominate).
#### Data scaling ####
all.genes <- rownames(pit_seurat)
pit_seurat <- ScaleData(pit_seurat, features = all.genes) # pit_seurat[["RNA"]]@scale.data
3000 variable features chosen previously are used for linear dimensionality reduction with PCA. Genes associated with the top 5 PCs are visualized. Expression of genes associated with the top 15 PCs in 500 cells displaying “extreme” expression from both ends of the spectrum are shown in the heatmap. Heatmap can be used to visualize sources heterogeneity in the dataset and inform the which PCs to include in downstream analyses.
#### Linear dimension reduction (PCA) ####
pit_seurat <- RunPCA(pit_seurat, features = VariableFeatures(object = pit_seurat))
print(pit_seurat[["pca"]], dims = 1:5, nfeatures = 5)
## PC_ 1
## Positive: Ifitm3, Timp3, Tm4sf1, S100a16, Anxa2
## Negative: Dlk1, Pcsk1n, Nnat, Resp18, Scg5
## PC_ 2
## Positive: Mia, Cyp2f2, Aldoc, Lcn2, Folr1
## Negative: Plvap, Kdr, Ctla2a, Egfl7, Emcn
## PC_ 3
## Positive: Dcn, Col3a1, Col1a1, Mfap4, Olfml3
## Negative: Plvap, Spint2, Kdr, Ctla2a, Egfl7
## PC_ 4
## Positive: Ctss, Tyrobp, C1qc, C1qb, C1qa
## Negative: Dcn, Bgn, Itih5, Col3a1, Col1a1
## PC_ 5
## Positive: Birc5, Pbk, Spc25, Cdca3, Cdk1
## Negative: Dapl1, Cyp2f2, Mia, Aldoc, Lcn2
print(VizDimLoadings(pit_seurat, dims = 1:5, reduction = "pca"))
Genes associated with top 5 principal components.
DimPlot(pit_seurat, reduction = "pca")
Top 2 principal components of data.
DimHeatmap(pit_seurat, dims = 1:15, cells = 500, balanced = TRUE)
Expression of genes associated with the first 15 principal components.
Cells are clustered based on their PCA scores where each PC is a “metafeature” that combines info across correlated feature set. To determine the number of PCs to use, an Elbow Plot is generated. The elbow plot ranks the principle components based on % variance explained by each one. The point at which we observe an ‘elbow’ or decrease in steepness of the line suggests that the majority of true signal is captured before this point. Here we observe the elbow around PC15, hence we will use 15 PCs for our downstream analysis.
#### Elbow plot ####
print(ElbowPlot(pit_seurat))
Elbow plot ranking principal components.
Graph-based clustering approach where genes which similar feature expression patterns have edges drawn between them. Partition of graph into “communities” where nodes/genes are highly interconnected is then applied after. First, K-nearest neighbour (KNN) graph based on euclidean distance in PCA space is constructed. Second, edge weights between cells are refined based on shared overlap in ‘local neighbourhood’ (Jaccard similarity). This second step takes in as input the previously defined dimensionality of the dataset (first 15 PCs).
#### Cell clustering ####
pit_seurat <- FindNeighbors(pit_seurat, dims = 1:15)
To cluster the cells, Louvain algorithm is applied. Cells are grouped together iteratively, with goal of optimizing standard modularity function. Resolution increases for larger datasets and increases the number of clusters resulting. Here, 19 cell clusters result.
#### Louvain algorithm ####
pit_seurat <- FindClusters(pit_seurat, resolution = 1)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 13565
## Number of edges: 462073
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8725
## Number of communities: 19
## Elapsed time: 2 seconds
head(Idents(pit_seurat), 5)
## AAACCTGAGCTAACTC-1 AAACCTGCAAACTGCT-1 AAACCTGCAATGGAAT-1 AAACCTGCACGCATCG-1
## 5 12 16 6
## AAACCTGCAGCTCGAC-1
## 8
## Levels: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
UMAP reduce dimensions of the data and allow for visualization of the dataset in 2D space. These algorithms try to learn the underlying manifold of the data to try to place similar cells together in low-dim space.
#### Non-linear dimensional reduction ####
pit_seurat <- RunUMAP(pit_seurat, dims = 1:15)
dir.create("output_files")
p1 <- DimPlot(pit_seurat, reduction = "umap", label = T)
p2 <- DimPlot(pit_seurat, reduction = "umap", label = F)
pdf("output_files/seurat_v3_cheungpit_umap.pdf", width = 7, height = 6)
print(p1)
print(p2)
tmp <- dev.off()
print(p1+p2)
UMAP visualization of cell clusters.
This step may take longer due to file size (~2 GB)
saveRDS(pit_seurat, file = "output_files/seurat_v3_pitCheung.rds")
Differential expression of genes within different clusters is tested to identify gene markers for each cluster. Positive gene markers are identified for each cluster. Only genes which are detected in >25% in either one of the two clusters of a given comparison are used. The default Wilcoxon test is used with logFC threshold of 0.25. Violin plots and UMAP are used to visualize known anterior hormone-producing cell-type markers in each cluster. Other pituitary cell-type markers are also visualized (pituicytes, FSCs, endothelial cells). Finally, the top 10 expressing genes in each cluster is plotted as a heatmap.
#### Find cluster biomarkers ####
# pit_seurat <- readRDS("output_files/seurat_v3_pitCheung.rds")
Idents(pit_seurat) <- pit_seurat$seurat_clusters
pit_markers <- FindAllMarkers(pit_seurat, only.pos = T, min.pct = 0.25)
kable(head(pit_markers), caption = "Example cluster gene markers identified.")
| p_val | avg_logFC | pct.1 | pct.2 | p_val_adj | cluster | gene | |
|---|---|---|---|---|---|---|---|
| Gh | 0 | 0.8956397 | 1.000 | 1.000 | 0 | 0 | Gh |
| Dapl1 | 0 | 0.5608014 | 0.827 | 0.393 | 0 | 0 | Dapl1 |
| Dlk1 | 0 | 0.4338954 | 1.000 | 0.861 | 0 | 0 | Dlk1 |
| Ghrhr | 0 | 0.2738138 | 0.718 | 0.328 | 0 | 0 | Ghrhr |
| Cidea | 0 | 0.2943796 | 0.622 | 0.299 | 0 | 0 | Cidea |
| Snhg18 | 0 | 0.3136085 | 0.934 | 0.667 | 0 | 0 | Snhg18 |
saveRDS(pit_markers, "output_files/seurat_v3_cheung_etal_2018_allposmarkers.rds")
write.table(pit_markers, "output_files/seurat_v3_cheungpit_allposmarker_genes.txt",
sep = "\t", col.names = T, row.names = F, quote = F)
dir.create("output_files/seurat_marker_plots")
# Gh = somatotrope (0,1,4,5,13)
# Prl = lactotrope (2,3,6)
# Lhb = gonadotrope (7)
# Fshb = gonadotrope (7)
# Tshb = thyrotrope (8)
# Pomc = corticotrope (12,14,15) anterior lobe
# Pax7 = melanotrope (14) intermediate lobe
# Sox2 = stem-cells (9,18)
# Pou1f1 = Pou1f1 expressing (0,1,2,3,4,5,6,8,10,13)
p <- VlnPlot(pit_seurat, features = c("Gh", "Prl", "Lhb", "Fshb", "Tshb", "Pomc", "Pax7", "Sox2", "Pou1f1"), pt.size = 0.5)
print(p)
Violin plot of anterior pituitary gene marker expression.
pdf("output_files/seurat_marker_plots/cheungpit_anterior_markers_violin.pdf", width = 13, height = 8)
print(p)
tmp <- dev.off()
# Col25a1*, Mest*, Srebf1*, Nkx2-1 = Pituicyte (posterior) (18)
# Pcsk2*, Scg2*, Chga* = Intermediate lobe cells/melantotropes (14)
# Ctss*, C1qa^, Cx3cr1* = Macrophage/microglia/WBC (17)
# S100b&, Fxyd1&, Gstm2&, Capn6& = Folliculostellate cells (FSC) (9,16)
# Emcn*, Flt1*, Plvap*, Pecam1 = Vascular endothelia (11)
# Ogn*, Lum*, Dcn* = VLMC (vascular and leptomeningeal cells) (16)
# Col1a1 = Connective tissue (16)
# Hbb-bt = RBC
# * identified from Posterior pituitary paper bioRxiv Chen et al 2019 (3 mo mice)
# ^ identified from both bioRxiv Chen et al 2019 and Cheung et al 2018
# & identified from Fletcher et al. 2019
p <- VlnPlot(pit_seurat, features = c("Col25a1", "Nkx2-1", # Posterior
"Pcsk2", "Col1a1", # Intermediate/melano
"Ctss", "C1qa", # Immune cells/WBC
"Fxyd1", "Gstm2", # FSC/Stem cells
"Emcn", "Pecam1", # Vascular endothelia
"Col1a1", "Dcn", # Connective tissue
"Hbb-bt") # RBC
, pt.size = 0.5)
print(p)
Violin plot of other pituitary gene marker expression.
pdf("output_files/seurat_marker_plots/cheungpit_other_markers_violin.pdf", width = 13, height = 6)
print(p)
tmp <- dev.off()
p <- FeaturePlot(pit_seurat, features = c("Gh", "Prl", "Lhb", "Fshb", "Tshb", "Pomc", "Pax7", "Sox2", "Pou1f1",
"Nkx2-1", "Pecam1", "Col1a1", "Hbb-bt", "C1qa",
"Fxyd1"), reduction = "umap")
print(p)
UMAP of pituitary gene markers.
pdf("output_files/seurat_marker_plots/cheungpit_markers_umap.pdf", width = 13, height = 8)
print(p)
tmp <- dev.off()
top10 <- pit_markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)
p <- DoHeatmap(pit_seurat, features = top10$gene) + NoLegend()
print(p)
Heatmap of top 10 gene markers in each cell cluster.
pdf("output_files/seurat_marker_plots/cheungpit_TOP10markers_heatmap.pdf", width = 12, height = 12)
print(p)
tmp <- dev.off()
Clusters on are manually labelled and grouped based on curated known marker genes.
#### Manual cluster labelling ####
current_cluster_ids <- 0:18
new_cluster_ids <- c("Somatotropes", #0
"Somatotropes",#1
"Lactotropes",#2
"Lactotropes",#3
"Somatotropes",#4
"Somatotropes",#5
"Lactotropes", #6
"Gonadotropes", #7
"Thyrotropes", #8
"Stem-cells (Sox2)/FSC", #9
"Proliferating Pou1f1", #10
"Vascular Endothelia", #11
"Corticotropes", #12
"Somatotropes", #13
"Melanotropes (intermediate)", #14
"Corticotropes", #15
"Connective tissue/VLMC", #16
"Immune cells", #17
"Pituicytes (posterior)") #18
pit_seurat_manual <- pit_seurat
names(new_cluster_ids) <- levels(pit_seurat_manual)
pit_seurat_manual <- RenameIdents(pit_seurat_manual, new_cluster_ids)
Gene markers are reidentified using manually grouped cell clusters. The same parameters are run with FindAllMarkers. Additionally, both positive and negative gene markers are identified for downstream analyses purposes (scMappR).
#### Manual marker identification ####
man_markers <- FindAllMarkers(pit_seurat_manual, only.pos = T, min.pct = 0.25)
kable(head(man_markers), caption = "Example manual cluster gene markers identified.")
| p_val | avg_logFC | pct.1 | pct.2 | p_val_adj | cluster | gene | |
|---|---|---|---|---|---|---|---|
| Gh | 0 | 1.9514522 | 1.000 | 1.000 | 0 | Somatotropes | Gh |
| Dapl1 | 0 | 0.7589953 | 0.735 | 0.220 | 0 | Somatotropes | Dapl1 |
| Dlk1 | 0 | 0.7502002 | 0.999 | 0.782 | 0 | Somatotropes | Dlk1 |
| Ghrhr | 0 | 0.5354975 | 0.681 | 0.133 | 0 | Somatotropes | Ghrhr |
| Snhg18 | 0 | 0.4780746 | 0.892 | 0.548 | 0 | Somatotropes | Snhg18 |
| Ppp1r14a | 0 | 0.3450993 | 0.608 | 0.172 | 0 | Somatotropes | Ppp1r14a |
write.table(man_markers, "output_files/seurat_v3_cheung_etal_manual_anno_cell_cluster_allposmarkers.txt", sep = "\t", quote = F,col.names = T, row.names = F)
man_markers <- FindAllMarkers(pit_seurat_manual, only.pos = F, min.pct = 0.25)
write.table(man_markers, "output_files/seurat_v3_cheung_etal_manual_anno_cell_cluster_allmarkers.txt", sep = "\t", quote = F,col.names = T, row.names = F)
UMAPs are revisualized with manually labelled clusters.
#### Manual cell cluster UMAP ####
p1 <- DimPlot(pit_seurat_manual, reduction = "umap", label = T)
p2 <- DimPlot(pit_seurat_manual, reduction = "umap", label = F)
print(p1+p2)
pdf("output_files/cheungpit_umap_manual_celltypes.pdf", width = 7, height = 6, useDingbats = F)
print(p1)
print(p2)
tmp <- dev.off()
UMAP visualization of manually labelled cell clusters.
saveRDS(pit_seurat_manual, "output_files/seurat_v3_cheung_etal_manual_cell_anno.rds")
Co-expression module gene data is loaded in. A colour palette is made based on the cell-types.
#### Enrichment of module genes in single cell populations ####
seurat <- pit_seurat
seuratman <- pit_seurat_manual
clusters <- seurat@active.ident
manclusters <- seuratman@active.ident
mod_utr <- readRDS("input_files/pit_utrseq_coexpression_modules_annotation.rds")
mod_utr$modules <- factor(mod_utr$modules)
celltypes <- new_cluster_ids
# Create colour palette for heatmap annotation of cell types
ggcolours <- hue_pal()(12)
names(ggcolours) <- levels(manclusters)
celltypes_col <- celltypes
for(i in 1:length(celltypes_col)) {
celltypes_col[i] <- ggcolours[celltypes_col[i]]
}
names(celltypes_col) <- as.character(celltypes)
KS test is run and FDR is calculated for each module gene in each cell cluster (not manually labelled). Genes with enriched with FC > 0, FDR adj < 0.05 in at least one cell cluster compared to all other clusters are retained. Heatmap is plotted with FDR value for each module gene retained in each cell cluster.
#Run KS test and calculate FDR for each module gene
ks_res <- lapply(levels(mod_utr$modules), function(x) calc_stat_test(mod_utr[mod_utr$modules == x, "genename"], "ks", seurat, clusters))
ks_respval <- lapply(ks_res, function(x) na.exclude(x[["pval"]]))
num_comparisons <- sum(unlist(lapply(ks_respval, function(x) print(as.numeric(dim(x)[1]) * as.numeric(dim(x)[2])))))
## [1] 7410
## [1] 2774
## [1] 2299
## [1] 1843
## [1] 1425
## [1] 1197
## [1] 1349
## [1] 1178
## [1] 798
#Calculate FDR adjusted pvalue and filter for minimum FDR adj < 0.05 in each row
ks_resfdr <- lapply(ks_respval, function(x) matrix(p.adjust(as.matrix(x), method = "fdr", n = sum(num_comparisons)), nrow = nrow(x), ncol = ncol(x), dimnames = dimnames(x)))
ks_fdrcut <- lapply(ks_resfdr, function(x) x[rowMin(x) <= 0.05, ])
#Filter FC calculated so that only genes with FC > 0 in at least one comparison is included
ks_fc <- lapply(ks_res, function(x) na.exclude(x[["FC"]]))
names(ks_fdrcut) <- names(ks_fc) <- paste0("M", 1:9)
ks_fccut <- lapply(names(ks_fc), function(x) ks_fc[[x]][rownames(ks_fc[[x]]) %in% rownames(ks_fdrcut[[x]]), ])
ks_fccut <- lapply(ks_fccut, function(x) x[abs(rowSums(x)) > 0, ])
#Combine FDR df from all modules into one df
all_ks <- bind_rows(lapply(ks_fdrcut, function(x) as.data.frame(cbind("gene" = rownames(x), x))))
rownames(all_ks) <- all_ks$gene
all_ks <- all_ks[, -1]
gene_names <- rownames(all_ks)
all_ks <- sapply(all_ks, function(x) as.numeric(as.character(x)))
rownames(all_ks) <- gene_names
#Annotate each gene to corresponding module
anno_row <- bind_rows(lapply(names(ks_fdrcut), function(x) as.data.frame(cbind("gene" = rownames(ks_fdrcut[[x]]), "module" = x) )))
rownames(anno_row) <- anno_row$gene
anno_col <- list(celltype = ggcolours)
#Calculate breaks in heatmap based on modules
indices <- sapply(ks_fdrcut, nrow)
for(i in 2:length(indices)) {
x <- as.numeric(indices[i])
indices[i] <- as.numeric(indices[i - 1]) + as.numeric(indices[i])
}
cell_anno <- as.data.frame(cbind("cluster" = 0:18, "celltype" = celltypes))
rownames(cell_anno) <- cell_anno$cluster
use_breaks <- get_htmap_breaks(all_ks)
p <- pheatmap(all_ks,
# border_color = "black",
color = colorRampPalette(c("darkmagenta", "snow2"))(250),
annotation_col = cell_anno[, 2, drop = F],
cluster_rows = F,
cluster_cols = T,
# scale = "row",
annotation_colors = anno_col,
gaps_row = indices,
show_rownames = F,
labels_col = paste(cell_anno$cluster, cell_anno$celltype)
# breaks = use_breaks2,
)
save_phtmap_pdf(p, "output_files/pit_utr_module_cheung_scrna_ks_FDR_cellenrichment.pdf", 8, 10)
print(p)
Cell clusters are clustered based on similarity.
tmp <- dev.off()
all_ks_fdr <- bind_rows(lapply(ks_fdrcut, function(x) as.data.frame(cbind("gene" = rownames(x), x))))
rownames(all_ks_fdr) <- all_ks_fdr$gene
all_ks_fdr <- all_ks_fdr[, -1]
gene_names <- rownames(all_ks_fdr)
all_ks_fdr <- sapply(all_ks_fdr, function(x) as.numeric(as.character(x)))
rownames(all_ks_fdr) <- gene_names
indices <- sapply(ks_fdrcut, nrow)
for(i in 2:length(indices)) {
x <- as.numeric(indices[i])
indices[i] <- as.numeric(indices[i - 1]) + as.numeric(indices[i])
}
cell_anno <- as.data.frame(cbind("cluster" = 0:18, "celltype" = celltypes))
rownames(cell_anno) <- cell_anno$cluster
all_ks_fdr_order <- all_ks_fdr[, order(cell_anno$celltype)]
p <- pheatmap(all_ks_fdr_order,
# border_color = "black",
color = colorRampPalette(c("darkmagenta", "snow3"))(250),
annotation_col = cell_anno[, 2, drop = F],
cluster_rows = F,
cluster_cols = F,
# scale = "row",
annotation_colors = anno_col,
gaps_row = indices,
show_rownames = F
# breaks = use_breaks,
)
save_phtmap_pdf(p, "output_files/pit_utr_module_cheung_scrna_ks_FDR_cellenrichment_cellordered.pdf", 8, 10)
print(p)
Cell clusters are clustered based on manual cell-type annotation.
tmp <- dev.off()
KS test is run and FDR is calculated for each module gene in each manually annotated cell-type. Genes with enriched with FC > 0, FDR adj < 0.05 in at least one cell-type compared to all other cell-types are retained. Heatmap is plotted with FDR value for each module gene retained in each cell-type.
#### KS test manual annotation ####
#Run KS test and calculate FDR for each module gene
ks_res <- lapply(levels(mod_utr$modules), function(x) calc_stat_test(mod_utr[mod_utr$modules == x, "genename"], "ks", seuratman, manclusters))
ks_respval <- lapply(ks_res, function(x) na.exclude(x[["pval"]]))
num_comparisons <- sum(unlist(lapply(ks_respval, function(x) print(as.numeric(dim(x)[1]) * as.numeric(dim(x)[2])))))
## [1] 4680
## [1] 1752
## [1] 1452
## [1] 1164
## [1] 900
## [1] 756
## [1] 852
## [1] 744
## [1] 504
#Calculate FDR adjusted pvalue and filter for minimum FDR adj < 0.05 in each row
ks_resfdr <- lapply(ks_respval, function(x) matrix(p.adjust(as.matrix(x), method = "fdr", n = sum(num_comparisons)), nrow = nrow(x), ncol = ncol(x), dimnames = dimnames(x)))
ks_fdrcut <- lapply(ks_resfdr, function(x) x[rowMin(x) <= 0.05, ])
#Filter FC calculated so that only genes with FC > 0 in at least one comparison is included
ks_fc <- lapply(ks_res, function(x) na.exclude(x[["FC"]]))
names(ks_fdrcut) <- names(ks_fc) <- paste0("M", 1:9)
ks_fccut <- lapply(names(ks_fc), function(x) ks_fc[[x]][rownames(ks_fc[[x]]) %in% rownames(ks_fdrcut[[x]]), ])
ks_fccut <- lapply(ks_fccut, function(x) x[abs(rowSums(x)) > 0, ])
#Combine FDR df from all modules into one df
all_ks <- bind_rows(lapply(ks_fdrcut, function(x) as.data.frame(cbind("gene" = rownames(x), x))))
rownames(all_ks) <- all_ks$gene
all_ks <- all_ks[, -1]
gene_names <- rownames(all_ks)
all_ks <- sapply(all_ks, function(x) as.numeric(as.character(x)))
rownames(all_ks) <- gene_names
#Annotate each gene to corresponding module
anno_row <- bind_rows(lapply(names(ks_fdrcut), function(x) as.data.frame(cbind("gene" = rownames(ks_fdrcut[[x]]), "module" = x) )))
rownames(anno_row) <- anno_row$gene
anno_col <- list(celltype = ggcolours)
#Calculate breaks in heatmap based on modules
indices <- sapply(ks_fdrcut, nrow)
for(i in 2:length(indices)) {
x <- as.numeric(indices[i])
indices[i] <- as.numeric(indices[i - 1]) + as.numeric(indices[i])
}
cell_anno <- as.data.frame(colnames(all_ks))
colnames(cell_anno) <- "celltype"
rownames(cell_anno) <- cell_anno$celltype
use_breaks <- get_htmap_breaks(all_ks)
p <- pheatmap(all_ks,
# border_color = "black",
color = colorRampPalette(c("darkmagenta", "snow2"))(250),
annotation_col = cell_anno[, 1, drop = F],
cluster_rows = F,
cluster_cols = T,
# scale = "row",
annotation_colors = anno_col,
gaps_row = indices,
show_rownames = F
# labels_col = paste(cell_anno$cluster, cell_anno$celltype),
# breaks = use_breaks2,
)
save_phtmap_pdf(p, "output_files/pit_utr_module_cheung_scrna_ks_FDR_cellenrichment_manual_anno.pdf", 8, 10)
print(p)
Cell-types are manually annotated.
tmp <- dev.off()
Signature matrix contains gene counts averaged across all the cells for a given manually annotated cell cluster. A factor of 1000 is used to ensure all the counts are integers (as required for CIBERSORT input). Signature matrix is then used with CIBERSORT script (Newman et al 2015) run on the CIBERSORT browser application (https://cibersort.stanford.edu/). CIBERSORT script is run with UTR-seq raw counts (after RUV-seq correction) and signature matrix with the following parameters: * permutations n = 100 * quantile normalization = F * absolute mode = F * absolute method = sig.score. For simplicity of running this script in one run, the CIBERSORT results are included in the input folder.
#### Make signature matrix ####
# with all gene expression and manually annotated cell clusters
pitman_countmat <- pit_seurat_manual[["RNA"]]@counts
colnames(pitman_countmat) <- pit_seurat_manual@active.ident
celltypes <- levels(pit_seurat_manual)
man_mean_mat <- sapply(celltypes, function(x) fill_seurat_mat_mean(x, pit_seurat_manual))
man_mean_mat <- man_mean_mat*1000
man_mean_mat <- apply(man_mean_mat, c(1,2), as.integer)
colnames(man_mean_mat) <- gsub(" ", "_", colnames(man_mean_mat))
colnames(man_mean_mat) <- gsub("/", "_", colnames(man_mean_mat))
man_mean_mat <- cbind("NAME" = rownames(man_mean_mat), man_mean_mat)
kable(man_mean_mat[1:5,], caption = "First 5 rows of signature matrix")
| NAME | Somatotropes | Lactotropes | Gonadotropes | Thyrotropes | Stem-cells_(Sox2)_FSC | Proliferating_Pou1f1 | Vascular_Endothelia | Corticotropes | Melanotropes_(intermediate) | Connective_tissue_VLMC | Immune_cells | Pituicytes_(posterior) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Xkr4 | Xkr4 | 5 | 8 | 17 | 13 | 20 | 9 | 0 | 9 | 14 | 2 | 0 | 0 |
| Gm1992 | Gm1992 | 0 | 28 | 2 | 10 | 0 | 8 | 0 | 2 | 5 | 0 | 0 | 0 |
| Rp1 | Rp1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Sox17 | Sox17 | 0 | 0 | 0 | 0 | 0 | 0 | 323 | 0 | 0 | 6 | 8 | 0 |
| Mrpl15 | Mrpl15 | 140 | 199 | 212 | 202 | 241 | 232 | 343 | 141 | 199 | 220 | 165 | 279 |
write.table(man_mean_mat, "output_files/seurat_v3_Cheungpit_manual_cluster_MEAN_gene_signature_matrix_2021-03-22.txt",
sep = "\t", quote = F, row.names = F, col.names = T)
Estimated cell proportions from CIBERSORT (https://cibersort.stanford.edu/) are plotted for each cell-type. For each cell-type, proportions are separated by age and sex. Wilcoxon test is performed between males and females at each age.
#### Plot CIBERSORT cell proportions ####
# CIBERSORT was run on the browser application
cib_res <- read.table("input_files/cibersort_browser_cheung_mean_rawcounts_2021-03-24/CIBERSORT.cibersort_browser_cheung_mean_rawcounts_2021-03-24.txt",
sep = "\t", header = T)
cib_plot <- tbl_df(melt(dplyr::select(cib_res, -Pearson.Correlation, -RMSE, -`P.value`)))
colnames(cib_plot) <- c("sample", "cell_type", "prop")
cib_plot <- mutate(cib_plot, age = substr(sample, 10, 11),
sex = substr(sample, 12, 12), rep = substr(sample, 13, 13))
cib_plot$age <- as.numeric(cib_plot$age)
gg_def_pal <- hue_pal()(length(levels(cib_plot$cell_type)))
lighter_cluster_colours <- sapply(gg_def_pal, function(x) lighten(x, amount = 0.5))
cluster_colours <- sapply(gg_def_pal, function(x) darken(x, amount = 0.2))
use_colors <- lapply(1:length(cluster_colours), function(x) c("F" = lighter_cluster_colours[[x]], "M" = cluster_colours[[x]]))
names(use_colors) <- levels(cib_plot$cell_type)
cell_prop_plots <- lapply(levels(cib_plot$cell_type), function(x) get_cell_prop_plot(x, cib_plot, use_colors))
## [1] "Somatotropes"
## [1] "Lactotropes"
## [1] "Gonadotropes"
## [1] "Thyrotropes"
## [1] "Stem.cells_.Sox2._FSC"
## [1] "Proliferating_Pou1f1"
## [1] "Vascular_Endothelia"
## [1] "Corticotropes"
## [1] "Melanotropes_.intermediate."
## [1] "Connective_tissue_VLMC"
## [1] "Immune_cells"
## [1] "Pituicytes_.posterior."
p <- ggarrange(plotlist = cell_prop_plots)
pdf("output_files/cibersort_browser_cell_proportions_raw_counts_2021-03-24.pdf", width = 14, height = 7, useDingbats = F)
print(p)
dev.off()
## quartz_off_screen
## 2
print(p)
CIBERSORT estimated cell propotions.
sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Mojave 10.14.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] openxlsx_4.2.5 ggrepel_0.9.1
## [3] colorspace_2.0-2 reshape2_1.4.4
## [5] ggpubr_0.4.0 plyr_1.8.6
## [7] LaCroixColoR_0.1.0 pheatmap_1.0.12
## [9] EDASeq_2.28.0 ShortRead_1.52.0
## [11] GenomicAlignments_1.30.0 SummarizedExperiment_1.24.0
## [13] MatrixGenerics_1.6.0 matrixStats_0.61.0
## [15] Rsamtools_2.10.0 GenomicRanges_1.46.1
## [17] Biostrings_2.62.0 GenomeInfoDb_1.30.0
## [19] XVector_0.34.0 IRanges_2.28.0
## [21] S4Vectors_0.32.3 BiocParallel_1.28.3
## [23] Biobase_2.54.0 BiocGenerics_0.40.0
## [25] scales_1.1.1 intergraph_2.0-2
## [27] ggraph_2.0.5 ggplot2_3.3.5
## [29] igraph_1.2.11 biomaRt_2.50.1
## [31] knitr_1.37 Seurat_3.2.3
## [33] dplyr_1.0.7
##
## loaded via a namespace (and not attached):
## [1] utf8_1.2.2 reticulate_1.22 R.utils_2.11.0
## [4] tidyselect_1.1.1 RSQLite_2.2.9 AnnotationDbi_1.56.2
## [7] htmlwidgets_1.5.4 Rtsne_0.15 munsell_0.5.0
## [10] codetools_0.2-18 ica_1.0-2 future_1.23.0
## [13] miniUI_0.1.1.1 withr_2.4.3 filelock_1.0.2
## [16] highr_0.9 ROCR_1.0-11 ggsignif_0.6.3
## [19] tensor_1.5 listenv_0.8.0 labeling_0.4.2
## [22] GenomeInfoDbData_1.2.7 hwriter_1.3.2 polyclip_1.10-0
## [25] bit64_4.0.5 farver_2.1.0 coda_0.19-4
## [28] parallelly_1.30.0 vctrs_0.3.8 generics_0.1.1
## [31] xfun_0.29 BiocFileCache_2.2.0 R6_2.5.1
## [34] graphlayouts_0.8.0 rsvd_1.0.5 bitops_1.0-7
## [37] spatstat.utils_2.3-0 cachem_1.0.6 DelayedArray_0.20.0
## [40] assertthat_0.2.1 promises_1.2.0.1 BiocIO_1.4.0
## [43] gtable_0.3.0 globals_0.14.0 goftest_1.2-3
## [46] tidygraph_1.2.0 rlang_0.4.12 splines_4.1.2
## [49] rstatix_0.7.0 rtracklayer_1.54.0 lazyeval_0.2.2
## [52] broom_0.7.11 yaml_2.2.1 abind_1.4-5
## [55] backports_1.4.1 GenomicFeatures_1.46.3 httpuv_1.6.5
## [58] tools_4.1.2 statnet.common_4.5.0 ellipsis_0.3.2
## [61] jquerylib_0.1.4 RColorBrewer_1.1-2 ggridges_0.5.3
## [64] Rcpp_1.0.7 progress_1.2.2 zlibbioc_1.40.0
## [67] purrr_0.3.4 RCurl_1.98-1.5 prettyunits_1.1.1
## [70] rpart_4.1-15 deldir_1.0-6 pbapply_1.5-0
## [73] viridis_0.6.2 cowplot_1.1.1 zoo_1.8-9
## [76] cluster_2.1.2 magrittr_2.0.1 RSpectra_0.16-0
## [79] data.table_1.14.2 scattermore_0.7 lmtest_0.9-39
## [82] RANN_2.6.1 fitdistrplus_1.1-6 aroma.light_3.24.0
## [85] hms_1.1.1 patchwork_1.1.1 mime_0.12
## [88] evaluate_0.14 xtable_1.8-4 XML_3.99-0.8
## [91] jpeg_0.1-9 gridExtra_2.3 compiler_4.1.2
## [94] tibble_3.1.6 KernSmooth_2.23-20 crayon_1.4.2
## [97] R.oo_1.24.0 htmltools_0.5.2 mgcv_1.8-38
## [100] later_1.3.0 tidyr_1.1.4 DBI_1.1.2
## [103] tweenr_1.0.2 dbplyr_2.1.1 MASS_7.3-54
## [106] rappdirs_0.3.3 car_3.0-12 Matrix_1.4-0
## [109] R.methodsS3_1.8.1 parallel_4.1.2 pkgconfig_2.0.3
## [112] plotly_4.10.0 xml2_1.3.3 bslib_0.3.1
## [115] stringr_1.4.0 digest_0.6.29 sctransform_0.3.2
## [118] RcppAnnoy_0.0.19 spatstat.data_2.1-2 rmarkdown_2.11
## [121] leiden_0.3.9 uwot_0.1.11 restfulr_0.0.13
## [124] curl_4.3.2 shiny_1.7.1 rjson_0.2.21
## [127] lifecycle_1.0.1 nlme_3.1-153 jsonlite_1.7.2
## [130] carData_3.0-5 network_1.17.1 viridisLite_0.4.0
## [133] fansi_1.0.0 pillar_1.6.4 lattice_0.20-45
## [136] KEGGREST_1.34.0 fastmap_1.1.0 httr_1.4.2
## [139] survival_3.2-13 glue_1.6.0 zip_2.2.0
## [142] spatstat_1.64-1 png_0.1-7 bit_4.0.4
## [145] ggforce_0.3.3 stringi_1.7.6 sass_0.4.0
## [148] blob_1.2.2 latticeExtra_0.6-29 memoise_2.0.1
## [151] irlba_2.3.5 future.apply_1.8.1